3D Object Detection

Region-Aware Knowledge Distillation between Monocular Camera-based 3D Object Detectors

Overview

This paper proposes Region-Aware Knowledge Distillation (RAKD), a novel method to improve monocular 3D object detection by transferring knowledge from a stronger "expert" model to a lightweight "apprentice" model. Unlike traditional knowledge distillation approaches that rely on LiDAR or multi-camera inputs, RAKD works only with monocular camera data, making it more practical and cost-efficient.

Fig 1. The overall structure of our region-aware knowledge distillation for monocular-based 3D detection.

Multi-Scale Region Feature Alignment with Soft Weighting

Instead of rigid one-to-one matching at a single feature scale, RAKD performs multi-scale feature alignment across pyramid levels.
Introduces a soft assignment weight mechanism that uses a Gaussian distribution to model the relevance between an object's region and feature pyramid levels. This improves feature matching by accounting for object scale.

Performance & Efficiency:

Table 2. Comparisons with state-of-the-art 3D detectors on the KITTI test set. Expert (E) and apprentice (A) detectors applying for RAKD are shown in the last rows.
Latency measured by ours on the same single NVIDIA TITAN RTX is denoted with ∗. L and GF denote usuage of LiDAR and GFLOPs

Table 3. Comparisons with the recent PKD and SemCKD methods on the KITTI validation set. The best results and second-best scores are highlighted in bold and underlined (except for expert).
E and A also denote expert and apprentice detectors. All the latency is measured on the same machine. Avg. represents the score averaged for all classes and occlusion levels.

Ablation Study on the KITTI validation set. LRoI represents RoI feature alignment loss, HFA represents Hard feature assignment and SFA represents Soft feature assignment (LS RoI).
The best results are highlighted in bold, and the second-best results are underlined.